PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache by MatoTeziTanka · Pull Request #769 · openai/parameter-golf

MatoTeziTanka · 2026-03-25T21:28:08Z

Summary

val_bpb: 0.8495 (3-seed mean, std 0.0013) — updated from initial 0.8508
Current merged SOTA is Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 at 1.1194
LeakyReLU(0.9)² activation (based on our 7-point monotonic sweep)
Backward-looking 5-gram hash cache during sliding window eval (stride=64)
Fixed-alpha blending (0.8 model / 0.2 cache), no safety gate, no oracle selection
Verified at stride=2048 (zero overlap): 0.8709 BPB, 97.9% hit rate — not an overlap artifact
Training ≤ 600s, eval ≤ 371s, all artifacts < 16,000,000 bytes

Results (8×H100 SXM, RunPod)

Current Seeds (v1.1 — sliding window fix + script cleanup)

Seed	val_bpb	Artifact Size	Cache Hit Rate
42	0.8494	15,921,591 bytes	98.2%
1337	0.8482	15,919,103 bytes	98.2%
2024	0.8508	15,905,947 bytes	98.2%
Mean	0.8495		std: 0.0013

Training loop exit controlled by MAX_WALLCLOCK_SECONDS=600. Logged wallclock includes torch.cuda.synchronize() overhead (~60-120ms beyond the 600s check).

Superseded Seeds (v1.0)

We're showing the original v1.0 results for full transparency. They had two issues we caught in self-review: a seed 42 artifact that exceeded the 16MB cap, and a sliding window eval that never executed due to a double torch.compile invocation. Rather than quietly replace them, we're documenting what went wrong and why.

Seed	val_bpb	Artifact Size	Note
42	0.8513	16,025,731 bytes	Over 16MB cap
1337	0.8502	15,939,991 bytes
2024	0.8510	15,910,119 bytes
Mean	0.8508		std: 0.0006

These scores were from the int6 roundtrip eval path (non-sliding). The sliding window + n-gram cache eval path crashed silently under torchrun. Fixed in v1.1.

Overlap Verification

Stride	BPB	Hit Rate	Overlap
64 (standard)	0.8494	98.2%	97%
2048 (zero overlap)	0.8709	97.9%	0%
No cache	1.1477	—	—

The 0.02 BPB gap between stride=64 and stride=2048 is the overlap contribution. The remaining 0.26 BPB improvement is genuine cache benefit from backward-looking n-gram statistics.

Rule Compliance Checklist

Note on N-gram Cache Legality

The competition README does not address n-gram eval caches. No rule in the official documentation prohibits or permits this technique. The README states: "TTT only on tokens already graded" — our cache satisfies this: it is updated only with already-scored tokens. We note that 15+ concurrent PRs (#779, #797, #795, #786, #796, #798, #800, #806, among others) employ the same backward-looking n-gram cache concept.

Architecture

11L, 512d, GQA 8H/4KV, MLP 3×, LeakyReLU(0.9)², XSA (last 4 layers), Value Embedding, BigramHash(2048→128), Partial RoPE(16/64), LN Scale, EMA(0.997), Muon optimizer. Tied embeddings. Mixed int6/int8 quantization + LZMA compression.

Technique: 5-gram Eval Cache

During sliding window evaluation, a hash-based n-gram cache accumulates token statistics from already-scored windows. For each new window, the cache provides empirical next-token probabilities which are blended with the neural model's predictions using a fixed mixing coefficient. The cache is strictly causal — it never sees tokens before they are scored.

This is a pure eval-time technique. No architectural changes, no retraining, no TTT. The trained model is identical with or without the cache.

Related Work

The n-gram eval cache concept has seen significant community adoption since our initial analysis on Issue #140:

PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659 (@deanbrr) — First n-gram cache submission; ruled invalid for oracle min(NLL) gate, not for the cache concept
PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (@deanbrr) — BackoffNgramMixer + Drift-Free TTT (0.6683 BPB)
PR Record: 11L Full GPTQ + Multi-Order N-gram Backoff (fixed-alpha 0.9757 / entropy-adaptive 0.9605, 3-seed) #778 (@raahilshah) — Multi-order backoff with fixed and entropy-adaptive alpha
PR Record: 7-gram N-gram Cache (0.8960 bpb) #797 (@armantsaturian) — 7-gram cache (0.8960 BPB)
PR Record: 11L + order-adaptive 11-gram (mean val_bpb=0.8881) #795 (@hypery11) — Order-adaptive 11-gram (0.8881 BPB)
PR 0.8128 BPB: Classical Compression Eval + N-gram Backoff on PR #549 Base #786 (@shinegami-2002) — Classical compression + n-gram backoff (0.8128 BPB)
PR Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796 (@Robby955) — Prefill cache + 7-gram entropy-adaptive (0.6567 BPB)
PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 (@travispchen) — Order-adaptive entropy gating (0.5466 BPB)
PR Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800 (@newjordan) — Shared n-gram tables + Cubric (0.5644 BPB)
PR Record: Backoff N-gram Cache + LeakyReLU(0.9)² (val_bpb=0.6678) #806 (@ibarrajo) — Backoff n-gram + LeakyReLU(0.9)² (0.6678 BPB)

Our LeakyReLU(0.9)² slope sweep was independently cited by PR #764 (@ndokutovich).

Context

Same team that posted the compliance guide, LeakyReLU slope sweep, and n-gram cache analysis on Issue #140.

Docker: matotezitanka/proteus-pytorch:2.11.0-cuda12.8
RunPod template: Deploy PROTEUS+STYX

Verification

This submission was independently audited by OpenAI Codex CLI (gpt-5.4) as a cross-model peer reviewer — verifying rule compliance, cache ordering, artifact sizes, and training logs against competition rules. Both Claude Code (Anthropic) and Codex (OpenAI) were used throughout development: Claude Code for architecture, implementation, and competition analysis; Codex for independent verification and audit.

Built with PROTEUS+STYX by Light Speed Up

3-seed mean: 0.8508 (std 0.0006), verified at stride=2048 (0.8709) Beats SOTA openai#549 (1.1194) by 0.269 BPB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-03-25T22:22:10Z

Update — size issue on seed 42

We got excited and rushed this submission. On closer audit:

Seed 42 artifact: 16,025,731 bytes — over the 16MB cap by 25,731 bytes
Seed 1337: 15,939,991 bytes — under cap
Seed 2024: 15,910,119 bytes — under cap

Also correcting: submission.json had artifact sizes copied from an earlier submission (PR #95), not this one. That's our mistake.

We need to fix the code size (99KB is bloated) or adjust compression to get all 3 seeds under 16MB before this is reviewable. Working on it — will update.

- Fixed torch.compile double-invocation that silently killed sliding window eval - Trimmed train_gpt.py from 99KB to 72KB (removed dead TTT/QAT/LAWA/DTG code) - All 3 seeds re-run with sliding window + n-gram cache eval - New 3-seed mean: 0.8495 BPB (std 0.0013), all artifacts under 16,000,000 bytes - Old v1.0 logs preserved for transparency - Added rule compliance checklist, related work, cross-model audit (GPT Codex) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-03-26T04:40:55Z

Update — v1.1 results (3 new seeds, sliding window fix, script cleanup)

Two fixes since the initial submission:

Script cleanup. The original train_gpt.py was 99,492 bytes — a kitchen-sink build from rapid iteration where we were focused on making things work, not on byte efficiency. When we originally wrote it, we didn't realize code bytes count toward the 16MB artifact cap. That bloated script included dead TTT scaffolding, unused QAT branches, a LAWA weight averaging path we never activated, a warmup block that restored both model and optimizer state (making it a no-op), and several experimental feature flags that were disabled by default. Once we understood code size matters, we stripped it to 72,603 bytes — removing every line that didn't contribute to the final trained model or eval. No functional changes, just dead code removal.

Sliding window eval fix. The original submission had a bug where torch.compile was called twice on the eval model — once at the module level for the int6 roundtrip eval, then again inside eval_val_sliding on forward_logits. This caused the sliding window eval to crash silently under torchrun, meaning the initially reported 0.8508 BPB was from the int6 roundtrip path alone, not the sliding window + n-gram cache path. Fix: removed the redundant torch.compile inside eval_val_sliding.

New 3-seed results (all re-run from scratch on 8×H100 SXM):

Seed	Sliding BPB	Artifact Size
42	0.8494	15,921,591 bytes
1337	0.8482	15,919,103 bytes
2024	0.8508	15,905,947 bytes
Mean	0.8495	std: 0.0013

All artifacts under 16,000,000 bytes. Updated logs, submission.json, and cleaned train_gpt.py included.

Verification. This submission was independently audited by OpenAI Codex CLI (gpt-5.4) as a cross-model peer reviewer — verifying rule compliance, cache ordering, artifact sizes, and training logs against competition rules. Both Claude Code (Anthropic) and Codex (OpenAI) were used throughout development: Claude Code for architecture, implementation, and competition analysis; Codex for independent verification and audit. We believe cross-model review catches blind spots that single-model workflows miss.

Built with PROTEUS+STYX by Light Speed Up

hypery11 · 2026-03-26T06:36:24Z

nice 🔥🔥🔥🔥

MatoTeziTanka · 2026-03-26T14:31:24Z

@hypery11 Thanks! Really appreciate the support. Your order-adaptive entropy gating on #825 is clean work — the per-order threshold design is smart.

Left a note on your PR about a potential artifact size issue on 2 of your seeds. We hit the exact same thing on our seed 42 and it was a quick fix. Just wanted to flag it before review so it doesn't trip you up. 🤝

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

valerio-oai · 2026-03-27T22:50:30Z

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

MatoTeziTanka · 2026-03-28T14:20:07Z

Fair ruling. We built the n-gram cache in good faith based on the rules as we understood them at the time, but the normalization issue is real — @mhuen and @Eppie laid it out clearly in #677.

Our neural baseline without the cache was 1.15 BPB (EMA, pre-quant). We'll be back with a clean neural-only submission. The architecture work (LeakyReLU(0.9)², sliding window eval, INT6 quantization) still stands — just without the cache on top.

Thanks for going through 30+ PRs tonight. That's a lot of review.

Record: PROTEUS+STYX 0.8508 BPB — LeakyReLU(0.9)² + 5-gram eval cache

93f6a6a

3-seed mean: 0.8508 (std 0.0006), verified at stride=2048 (0.8709) Beats SOTA openai#549 (1.1194) by 0.269 BPB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka changed the title ~~Record: PROTEUS+STYX — val_bpb 0.8508 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache~~ PROTEUS+STYX — val_bpb 0.8508 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache Mar 25, 2026

Add requirements.txt to submission

1c51c10

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 25, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

MatoTeziTanka changed the title ~~PROTEUS+STYX — val_bpb 0.8508 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache~~ PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache Mar 26, 2026

MatoTeziTanka mentioned this pull request Mar 26, 2026

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) #825

Open

4 tasks

sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 26, 2026

exp54: 5-gram fixed alpha=0.2 cache (PR openai#769 recipe)

dcc4f69

sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 26, 2026

exp58: rewrite n-gram to match PR openai#753/openai#769/openai#779 (d…

9cd7357

…ual hash tables, per-window score-first, entropy-adaptive alpha, tc>0 check)

MatoTeziTanka mentioned this pull request Mar 27, 2026

Record: Frozen N-gram Oracle (Order-16) + Score-First TTT (0.02807 BPB) #925

Closed

valerio-oai closed this Mar 27, 2026

valerio-oai mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache#769

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache#769
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-styx-ngram-record

MatoTeziTanka commented Mar 25, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

MatoTeziTanka commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MatoTeziTanka commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (8×H100 SXM, RunPod)

Current Seeds (v1.1 — sliding window fix + script cleanup)

Overlap Verification

Rule Compliance Checklist

Note on N-gram Cache Legality

Architecture

Technique: 5-gram Eval Cache

Related Work

Context

Verification

Uh oh!

MatoTeziTanka commented Mar 25, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

MatoTeziTanka commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatoTeziTanka commented Mar 25, 2026 •

edited

Loading